Home Projects Bulk Messaging System Python Backend Services Contact Extraction Utilities TXT File Extraction

TXT File Extraction

Referenced Files

app.py extract_contacts.py parse_manual_numbers.py validate_number.py requirements.txt README.md

Table of Contents#

Introduction
Project Structure
Core Components
Architecture Overview
Detailed Component Analysis
Dependency Analysis
Performance Considerations
Troubleshooting Guide
Conclusion

Introduction#

This document provides comprehensive documentation for TXT file contact extraction capabilities within the Bulk Messaging System. It focuses on the regex-based phone number detection algorithm, multi-separator splitting logic, name extraction when names are combined with phone numbers, supported TXT formats, mixed format handling, edge cases, and fallback parsing strategies.

Project Structure#

The TXT extraction functionality is implemented in two primary locations:

A Flask-based API service that handles file uploads and contact extraction
Standalone Python utilities for direct command-line usage

graph TB subgraph "Python Backend" API["Flask API
app.py"] EXTRACT["Contact Extraction
extract_contacts.py"] MANUAL["Manual Numbers Parser
parse_manual_numbers.py"] VALIDATE["Phone Validator
validate_number.py"] end subgraph "External Dependencies" PANDAS["pandas"] OPENPYXL["openpyxl"] XLRD["xlrd"] FLASK["flask"] CORS["flask-cors"] end API --> EXTRACT API --> MANUAL API --> VALIDATE EXTRACT --> PANDAS API --> PANDAS API --> OPENPYXL API --> XLRD API --> FLASK API --> CORS

Diagram sources

Section sources

Core Components#

The TXT extraction system consists of several key components working together:

Phone Number Cleaning Algorithm#

The core phone number cleaning function performs systematic normalization:

Removes common separators (hyphens, spaces, parentheses, periods)
Strips non-digit characters except plus signs
Handles country code detection and normalization
Validates length constraints (7-15 digits)

TXT File Processing Pipeline#

The TXT extraction follows a multi-stage approach:

Line-by-line processing with UTF-8 encoding
Multi-separator splitting using comma, semicolon, tab, and pipe delimiters
Pattern-based phone number detection using regex
Fallback extraction when separators are absent
Name extraction from remaining parts

Fallback Parsing Strategies#

When initial parsing fails, the system employs progressive fallback mechanisms:

Separator-based splitting with multiple delimiter support
Whole-line regex matching for phone numbers
Name extraction from remaining text segments
Graceful error handling and empty line skipping

Section sources

Architecture Overview#

The TXT extraction architecture implements a layered approach with robust error handling and fallback mechanisms.

sequenceDiagram participant Client as "Client Application" participant API as "Flask API" participant TXT as "TXT Processor" participant CLEAN as "Phone Cleaner" participant FS as "File System" Client->>API : POST /upload (TXT file) API->>FS : Save uploaded file API->>TXT : extract_contacts_from_txt(file_path) TXT->>FS : Read file lines loop For each line TXT->>TXT : Split by separators TXT->>TXT : Detect phone number candidate TXT->>TXT : Extract name candidate TXT->>CLEAN : Clean phone number CLEAN-->>TXT : Normalized number TXT->>TXT : Validate length constraints end TXT-->>API : List of contacts API-->>Client : JSON response with contacts Note over TXT,CLEAN : Fallback processing when initial parsing fails

Diagram sources

Detailed Component Analysis#

TXT Extraction Algorithm#

The TXT extraction algorithm implements sophisticated pattern matching and parsing logic.

flowchart TD Start([Start TXT Processing]) --> ReadFile["Read file lines"] ReadFile --> ProcessLine["Process each line"] ProcessLine --> CheckEmpty{"Is line empty?"} CheckEmpty --> |Yes| NextLine["Skip to next line"] CheckEmpty --> |No| SplitParts["Split by separators:
[, ; \\t |]"] SplitParts --> FindPhone["Find phone candidates
using regex pattern"] FindPhone --> HasPhone{"Phone candidate found?"} HasPhone --> |Yes| ExtractName["Extract name from remaining parts"] HasPhone --> |No| FallbackRegex["Apply fallback regex
on entire line"] FallbackRegex --> FallbackFound{"Fallback match found?"} FallbackFound --> |Yes| ExtractFromLine["Extract from whole line"] FallbackFound --> |No| NextLine ExtractFromLine --> CleanNumber["Clean phone number"] ExtractName --> CleanNumber CleanNumber --> ValidateLength{"Validate length
(7-15 digits)"} ValidateLength --> |Valid| AddContact["Add to contacts list"] ValidateLength --> |Invalid| NextLine AddContact --> NextLine NextLine --> End([End])

Diagram sources

Phone Number Detection Patterns#

The system uses sophisticated regex patterns for phone number identification:

Primary Detection Pattern#

The main pattern [\d+\-\(\)\s]{7,} identifies phone numbers by:

Matching digits (\d+)
Including plus signs (+)
Allowing hyphens (\-)
Permitting parentheses (\(\))
Supporting spaces (\s)
Requiring minimum 7 characters for validation

Fallback Detection Pattern#

The fallback pattern [\+]?[\d\-\(\)\s]{7,} handles:

Optional leading plus sign
Flexible digit and separator combinations
Whole-line matching when separators are absent

Multi-Separator Splitting Logic#

The system supports four primary separators with equal precedence:

graph LR Input["Input Line"] --> Split["re.split(r'[,;\\t|]', line)"] Split --> Part1["Part 1"] Split --> Part2["Part 2"] Split --> Part3["Part 3"] Split --> PartN["Part N"] Part1 --> Check1{"Looks like phone?
regex test"} Part2 --> Check2{"Looks like phone?"} Part3 --> Check3{"Looks like phone?"} PartN --> CheckN{"Looks like phone?"} Check1 --> |Yes| Phone1["Select as phone candidate"] Check2 --> |Yes| Phone2["Select as phone candidate"] Check3 --> |Yes| Phone3["Select as phone candidate"] CheckN --> |Yes| PhoneN["Select as phone candidate"] Check1 --> |No| Name1["Select as name candidate"] Check2 --> |No| Name2["Select as name candidate"] Check3 --> |No| Name3["Select as name candidate"] CheckN --> |No| NameN["Select as name candidate"]

Diagram sources

Name Extraction Process#

When names are combined with phone numbers, the system implements intelligent extraction:

Mixed Format Handling#

The algorithm prioritizes:

First phone candidate: Selected when multiple phone-like segments exist
First non-empty segment: Used as name when no clear phone candidate exists
Fallback extraction: When separators are absent, the system extracts from the entire line

Name Candidate Selection#

flowchart TD Parts["Split parts"] --> LoopParts["Iterate through parts"] LoopParts --> CheckPhone{"Part matches phone pattern?"} CheckPhone --> |Yes| CheckExistingPhone{"Phone candidate exists?"} CheckExistingPhone --> |No| SetPhone["Set as phone candidate"] CheckExistingPhone --> |Yes| NextPart["Next part"] CheckPhone --> |No| CheckName{"Name candidate exists?"} CheckName --> |No| SetName["Set as name candidate"] CheckName --> |Yes| NextPart SetPhone --> NextPart SetName --> NextPart NextPart --> MoreParts{"More parts?"} MoreParts --> |Yes| LoopParts MoreParts --> |No| Done["Extraction complete"]

Diagram sources

Supported TXT Formats and Examples#

Standard Separated Format#

text

John Doe,123-456-7890
Jane Smith;john.doe@gmail.com|+1-555-0123
Bob Wilson|+44 20 7946 0958
Alice Brown,555.123.4567

Mixed Format Handling#

text

+1-555-0123 John Smith
5551234567 Jane Doe
+442079460958 Bob Wilson

Edge Case Examples#

text

John Doe,123-456-7890,,Extra Field
  
123-456-7890
Jane Smith

Error Recovery Mechanisms#

The system implements comprehensive error handling:

File Processing Errors#

UTF-8 encoding enforcement
Graceful handling of unreadable files
Empty file and directory handling

Parsing Failures#

Line-by-line processing with individual error isolation
Empty line skipping
Partial parsing continuation on errors

Phone Number Validation#

Length validation (7-15 digits)
Format normalization
Country code detection and correction

Section sources

Dependency Analysis#

The TXT extraction system relies on several key dependencies:

graph TB subgraph "Core Dependencies" RE["re (regex)"] OS["os"] SYS["sys"] JSON["json"] end subgraph "Optional Dependencies" PANDAS["pandas"] OPENPYXL["openpyxl"] XLRD["xlrd"] end subgraph "External Libraries" FLASK["flask"] CORS["flask-cors"] WERKZEUG["werkzeug"] end TXT_EXTRACT["TXT Extraction"] --> RE TXT_EXTRACT --> OS TXT_EXTRACT --> SYS TXT_EXTRACT --> JSON CSV_EXTRACT["CSV Extraction"] --> PANDAS EXCEL_EXTRACT["Excel Extraction"] --> PANDAS EXCEL_EXTRACT --> OPENPYXL EXCEL_EXTRACT --> XLRD API["Flask API"] --> FLASK API --> CORS API --> WERKZEUG API --> TXT_EXTRACT API --> CSV_EXTRACT API --> EXCEL_EXTRACT

Diagram sources

External Dependencies Impact#

pandas: Enables structured data processing for CSV/Excel files
openpyxl/xlrd: Provides Excel file format support
flask/flask-cors: Powers the web API interface
werkzeug: Handles file uploads and security

Section sources

Performance Considerations#

The TXT extraction system is optimized for efficiency and scalability:

Algorithm Complexity#

Time Complexity: O(n × m) where n is number of lines and m is average parts per line
Space Complexity: O(k) where k is number of valid contacts extracted
Memory Usage: Linear with respect to file size

Optimization Strategies#

Single-pass line processing
Early termination on empty lines
Minimal regex operations per line
Efficient string operations for phone number cleaning

Scalability Factors#

File size limitations (16MB max upload)
Memory constraints for large files
Regex compilation caching
Streaming file processing

Troubleshooting Guide#

Common Issues and Solutions#

Phone Number Not Detected#

Symptoms: Phone numbers appear as empty or invalid Causes:

Numbers shorter than 7 digits or longer than 15 digits
Unrecognized separators or formatting
Leading zeros in international numbers

Solutions:

Ensure numbers meet length requirements
Use recognized separators (spaces, hyphens, parentheses)
Include country codes for international numbers

Mixed Format Problems#

Symptoms: Names incorrectly extracted or phone numbers missed Causes:

Ambiguous separators causing misinterpretation
Names containing phone number patterns
Missing separators in lines

Solutions:

Use consistent separator usage
Place phone numbers first when mixing formats
Include explicit separators between name and number

File Encoding Issues#

Symptoms: Characters appear corrupted or parsing fails Causes:

Non-UTF-8 file encoding
Special characters not properly handled
BOM (Byte Order Mark) interference

Solutions:

Save files in UTF-8 encoding
Remove BOM if present
Verify character encoding compatibility

Performance Issues#

Symptoms: Slow processing for large files Causes:

Very large file sizes exceeding limits
Complex regex patterns
Memory constraints

Solutions:

Split large files into smaller chunks
Optimize regex patterns
Monitor memory usage during processing

Error Codes and Messages#

The system provides structured error reporting:

File not found errors
Unsupported file type errors
Processing exceptions with detailed messages
Validation failures with specific reasons

Section sources

Conclusion#

The TXT file extraction system provides robust, flexible contact processing with sophisticated regex-based phone number detection and intelligent name extraction. Its multi-layered approach ensures reliable parsing across various formats while maintaining strong error handling and performance characteristics. The system successfully handles edge cases, mixed formats, and provides comprehensive fallback mechanisms for maximum compatibility with diverse contact data sources.

The implementation demonstrates best practices in:

Progressive parsing with multiple fallback strategies
Comprehensive error handling and recovery
Efficient regex pattern matching
Flexible separator support
International phone number normalization

This foundation enables reliable bulk messaging operations while maintaining data integrity and user experience across different contact data formats.

Previous Excel File Extraction

Next Flask API Endpoints

Bulk Messaging System

API Reference

Contact Management System

Desktop Application Architecture

Gmail API Integration

Python Backend Services

SMTP Email Integration

Security Implementation

Whats App Integration

TXT File Extraction

Table of Contents#

Introduction#

Project Structure#

Core Components#

Phone Number Cleaning Algorithm#

TXT File Processing Pipeline#

Fallback Parsing Strategies#

Architecture Overview#

Detailed Component Analysis#

TXT Extraction Algorithm#

Phone Number Detection Patterns#

Primary Detection Pattern#

Fallback Detection Pattern#

Multi-Separator Splitting Logic#

Name Extraction Process#

Mixed Format Handling#

Name Candidate Selection#

Supported TXT Formats and Examples#

Standard Separated Format#

Mixed Format Handling#

Edge Case Examples#

Error Recovery Mechanisms#

File Processing Errors#

Parsing Failures#

Phone Number Validation#

Dependency Analysis#

External Dependencies Impact#

Performance Considerations#

Algorithm Complexity#

Optimization Strategies#

Scalability Factors#

Troubleshooting Guide#

Common Issues and Solutions#

Phone Number Not Detected#

Mixed Format Problems#

File Encoding Issues#

Performance Issues#

Error Codes and Messages#

Conclusion#